Amazon Fine Food Reviews Analysis

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

  1. Id
  2. ProductId - unique identifier for the product
  3. UserId - unqiue identifier for the user
  4. ProfileName
  5. HelpfulnessNumerator - number of users who found the review helpful
  6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
  7. Score - rating between 1 and 5
  8. Time - timestamp for the review
  9. Summary - brief summary of the review
  10. Text - text of the review

Objective:

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).


[Q] How to determine if a review is positive or negative?

[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

[1]. Reading Data

[1.1] Loading the data

The dataset is available in two forms

  1. .csv file
  2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it is easier to query the data and visualise the data efficiently.

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score is above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle
from prettytable import PrettyTable  
from tqdm import tqdm_notebook
import os
In [2]:
# using SQLite Table to read data.
con = sqlite3.connect('database.sqlite') 

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 100000""", con) 

# Give reviews with Score>3 a positive rating(1), and reviews with a score<3 a negative rating(0).
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)
Number of data points in our data (100000, 10)
Out[2]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 1 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 0 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 1 1219017600 "Delight" says it all This is a confection that has been around a fe...
In [3]:
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)
In [4]:
print(display.shape)
display.head()
(80668, 7)
Out[4]:
UserId ProductId ProfileName Time Score Text COUNT(*)
0 #oc-R115TNMSPFT9I7 B007Y59HVM Breyton 1331510400 2 Overall its just OK when considering the price... 2
1 #oc-R11D9D7SHXIJB9 B005HG9ET0 Louis E. Emory "hoppy" 1342396800 5 My wife has recurring extreme muscle spasms, u... 3
2 #oc-R11DNU2NBKQ23Z B007Y59HVM Kim Cieszykowski 1348531200 1 This coffee is horrible and unfortunately not ... 2
3 #oc-R11O5J5ZVQE25C B005HG9ET0 Penguin Chick 1346889600 5 This will be the bottle that you grab from the... 3
4 #oc-R12KPBODL2B5ZD B007OSBE1U Christopher P. Presta 1348617600 1 I didnt like this coffee. Instead of telling y... 2
In [5]:
display[display['UserId']=='AZY10LLTJ71NX']
Out[5]:
UserId ProductId ProfileName Time Score Text COUNT(*)
80638 AZY10LLTJ71NX B006P7E5ZI undertheshrine "undertheshrine" 1334707200 5 I was recommended to try green tea extract to ... 5
In [6]:
display['COUNT(*)'].sum()
Out[6]:
393063

[2] Exploratory Data Analysis

[2.1] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data. Following is an example:

In [7]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()
Out[7]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 78445 B000HDL1RQ AR5J8UI46CURR Geetha Krishnan 2 2 5 1199577600 LOACKER QUADRATINI VANILLA WAFERS DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1 138317 B000HDOPYC AR5J8UI46CURR Geetha Krishnan 2 2 5 1199577600 LOACKER QUADRATINI VANILLA WAFERS DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2 138277 B000HDOPYM AR5J8UI46CURR Geetha Krishnan 2 2 5 1199577600 LOACKER QUADRATINI VANILLA WAFERS DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3 73791 B000HDOPZG AR5J8UI46CURR Geetha Krishnan 2 2 5 1199577600 LOACKER QUADRATINI VANILLA WAFERS DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4 155049 B000PAQ75C AR5J8UI46CURR Geetha Krishnan 2 2 5 1199577600 LOACKER QUADRATINI VANILLA WAFERS DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...

As it can be seen above that same user has multiple reviews with same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that

ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)

ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [8]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
In [9]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape
Out[9]:
(87775, 10)
In [10]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100
Out[10]:
87.775

Observation:- It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [11]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()
Out[11]:
Id ProductId UserId ProfileName HelpfulnessNumerator HelpfulnessDenominator Score Time Summary Text
0 64422 B000MIDROQ A161DK06JJMCYF J. E. Stephens "Jeanne" 3 1 5 1224892800 Bought This for My Son at College My son loves spaghetti so I didn't hesitate or...
1 44737 B001EQ55RW A2V0I904FH7ABY Ram 3 2 4 1212883200 Pure cocoa taste with crunchy almonds inside It was almost a 'love at first bite' - the per...
In [12]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
In [13]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()
(87773, 10)
Out[13]:
1    73592
0    14181
Name: Score, dtype: int64

[3] Preprocessing

[3.1]. Preprocessing Review Text

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

  1. Begin by removing the html tags
  2. Remove any punctuations or limited set of special characters like , or . or # etc.
  3. Check if the word is made up of english letters and is not alpha-numeric
  4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
  5. Convert the word to lowercase
  6. Remove Stopwords
  7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)

After which we collect the words used to describe positive and negative reviews

In [14]:
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)
My dogs loves this chicken but its a product from China, so we wont be buying it anymore.  Its very hard to find any chicken products made in the USA but they are out there, but this one isnt.  Its too bad too because its a good product but I wont take any chances till they know what is going on with the china imports.
==================================================
The Candy Blocks were a nice visual for the Lego Birthday party but the candy has little taste to it.  Very little of the 2 lbs that I bought were eaten and I threw the rest away.  I would not buy the candy again.
==================================================
was way to hot for my blood, took a bite and did a jig  lol
==================================================
My dog LOVES these treats. They tend to have a very strong fish oil smell. So if you are afraid of the fishy smell, don't get it. But I think my dog likes it because of the smell. These treats are really small in size. They are great for training. You can give your dog several of these without worrying about him over eating. Amazon's price was much more reasonable than any other retailer. You can buy a 1 pound bag on Amazon for almost the same price as a 6 ounce bag at other retailers. It's definitely worth it to buy a big bag if your dog eats them a lot.
==================================================
In [15]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)
My dogs loves this chicken but its a product from China, so we wont be buying it anymore.  Its very hard to find any chicken products made in the USA but they are out there, but this one isnt.  Its too bad too because its a good product but I wont take any chances till they know what is going on with the china imports.
In [16]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)
My dogs loves this chicken but its a product from China, so we wont be buying it anymore.  Its very hard to find any chicken products made in the USA but they are out there, but this one isnt.  Its too bad too because its a good product but I wont take any chances till they know what is going on with the china imports.
==================================================
The Candy Blocks were a nice visual for the Lego Birthday party but the candy has little taste to it.  Very little of the 2 lbs that I bought were eaten and I threw the rest away.  I would not buy the candy again.
==================================================
was way to hot for my blood, took a bite and did a jig  lol
==================================================
My dog LOVES these treats. They tend to have a very strong fish oil smell. So if you are afraid of the fishy smell, don't get it. But I think my dog likes it because of the smell. These treats are really small in size. They are great for training. You can give your dog several of these without worrying about him over eating. Amazon's price was much more reasonable than any other retailer. You can buy a 1 pound bag on Amazon for almost the same price as a 6 ounce bag at other retailers. It's definitely worth it to buy a big bag if your dog eats them a lot.
In [17]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase
In [18]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)
was way to hot for my blood, took a bite and did a jig  lol
==================================================
In [19]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)
My dogs loves this chicken but its a product from China, so we wont be buying it anymore.  Its very hard to find any chicken products made in the USA but they are out there, but this one isnt.  Its too bad too because its a good product but I wont take any chances till they know what is going on with the china imports.
In [20]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)
was way to hot for my blood took a bite and did a jig lol
In [21]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])
In [22]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm_notebook(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

In [23]:
preprocessed_reviews[1500]
Out[23]:
'way hot blood took bite jig lol'

[3.2] Preprocessing Review Summary

In [43]:
## Similartly you can do preprocessing for review summary also.
import warnings
warnings.filterwarnings("ignore")

from tqdm import tqdm
preprocessed_summary = []
# tqdm is for printing the status bar
for sentance in tqdm_notebook(final['Summary'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_summary.append(sentance.strip())

[4] Featurization

[4.1] BAG OF WORDS

In [25]:
#BoW
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(preprocessed_reviews)
print("some feature names ", count_vect.get_feature_names()[:10])
print('='*50)

final_counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])
some feature names  ['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaaaa', 'aaaaaaahhhhhh', 'aaaaaaarrrrrggghhh', 'aaaaaawwwwwwwwww', 'aaaaah']
==================================================
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (87773, 54904)
the number of unique words  54904

[4.2] Bi-Grams and n-Grams.

In [26]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# you can choose these numebrs min_df=10, max_features=5000, of your choice
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (87773, 5000)
the number of unique words including both unigrams and bigrams  5000

[4.3] TF-IDF

In [27]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_tf_idf))
print("the shape of out text TFIDF vectorizer ",final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])
some sample features(unique words in the corpus) ['aa', 'aafco', 'aback', 'abandon', 'abandoned', 'abdominal', 'ability', 'able', 'able add', 'able brew']
==================================================
the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text TFIDF vectorizer  (87773, 51709)
the number of unique words including both unigrams and bigrams  51709

[4.4] Word2Vec

In [28]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())
In [29]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need

is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model=Word2Vec(list_of_sentance,min_count=5,size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
    
elif want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv.most_similar('great'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")
[('fantastic', 0.8577340841293335), ('awesome', 0.8323632478713989), ('good', 0.8179678916931152), ('excellent', 0.8136709332466125), ('terrific', 0.8035593628883362), ('wonderful', 0.7914960384368896), ('perfect', 0.774117112159729), ('amazing', 0.7496354579925537), ('nice', 0.7262898087501526), ('decent', 0.6892521381378174)]
==================================================
[('greatest', 0.7850901484489441), ('tastiest', 0.7464214563369751), ('best', 0.7278817892074585), ('nastiest', 0.703414261341095), ('disgusting', 0.6304017305374146), ('smoothest', 0.6098752021789551), ('awful', 0.5967636108398438), ('terrible', 0.5879918336868286), ('experienced', 0.5831197500228882), ('hottest', 0.5806283950805664)]
In [30]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])
number of words that occured minimum 5 times  17386
sample words  ['dogs', 'loves', 'chicken', 'product', 'china', 'wont', 'buying', 'anymore', 'hard', 'find', 'products', 'made', 'usa', 'one', 'isnt', 'bad', 'good', 'take', 'chances', 'till', 'know', 'going', 'imports', 'love', 'saw', 'pet', 'store', 'tag', 'attached', 'regarding', 'satisfied', 'safe', 'infestation', 'literally', 'everywhere', 'flying', 'around', 'kitchen', 'bought', 'hoping', 'least', 'get', 'rid', 'weeks', 'fly', 'stuck', 'squishing', 'buggers', 'success', 'rate']

[4.4.1] Converting text into vectors using Avg W2V, TFIDF-W2V

[4.4.1.1] Avg W2v

In [31]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))
87773
50

[4.4.1.2] TFIDF weighted W2v

In [32]:
# S = ["abc def pqr", "def def def abc", "pqr pqr def"]
model = TfidfVectorizer()
tf_idf_matrix = model.fit_transform(preprocessed_reviews)
# we are converting a dictionary with word as a key, and the idf as a value
dictionary = dict(zip(model.get_feature_names(), list(model.idf_)))
In [33]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(list_of_sentance): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1

[5] Assignment 8: Decision Trees

  1. Apply Decision Trees on these feature sets
    • SET 1:Review text, preprocessed one converted into vectors using (BOW)
    • SET 2:Review text, preprocessed one converted into vectors using (TFIDF)
    • SET 3:Review text, preprocessed one converted into vectors using (AVG W2v)
    • SET 4:Review text, preprocessed one converted into vectors using (TFIDF W2v)

  2. The hyper paramter tuning (best `depth` in range [1,5,10,50,100,500,1000] , and the best `min_samples_split` in range [2,5,10,15,100,500] )
    • Find the best hyper parameter which will give the maximum AUC value
    • Find the best hyper paramter using k-fold cross validation or simple cross validation data
    • Use gridsearch cv or randomsearch cv or you can also write your own for loops to do this task of hyperparameter tuning

  3. Graphviz
    • Visualize your decision tree with Graphviz. It helps you to understand how a decision is being made, given a new vector.
    • Since feature names are not obtained from word2vec related models, visualize only BOW & TFIDF decision trees using Graphviz
    • Make sure to print the words in each node of the decision tree instead of printing its index.
    • Just for visualization purpose, limit max_depth to 2 or 3 and either embed the generated images of graphviz in your notebook, or directly upload them as .png files.

  4. Feature importance
    • Find the top 20 important features from both feature sets Set 1 and Set 2 using `feature_importances_` method of Decision Tree Classifier and print their corresponding feature names

  5. Feature engineering
    • To increase the performance of your model, you can also experiment with with feature engineering like :
      • Taking length of reviews as another feature.
      • Considering some features from review summary as well.

  6. Representation of results
    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure with X-axis as min_sample_split, Y-axis as max_depth, and Z-axis as AUC Score , we have given the notebook which explains how to plot this 3d plot, you can find it in the same drive 3d_scatter_plot.ipynb
    • or


    • You need to plot the performance of model both on train data and cross validation data for each hyper parameter, like shown in the figure seaborn heat maps with rows as min_sample_split, columns as max_depth, and values inside the cell representing AUC Score
    • You choose either of the plotting techniques out of 3d plot or heat map
    • Once after you found the best hyper parameter, you need to train your model with it, and find the AUC on test data and plot the ROC curve on both train and test.
    • Along with plotting ROC curve, you need to print the confusion matrix with predicted and original labels of test data points. Please visualize your confusion matrices using seaborn heatmaps.

  7. Conclusion

Note: Data Leakage

  1. There will be an issue of data-leakage if you vectorize the entire data and then split it into train/cv/test.
  2. To avoid the issue of data-leakag, make sure to split your data first and then vectorize it.
  3. While vectorizing your data, apply the method fit_transform() on you train data, and apply the method transform() on cv/test data.
  4. For more details please go through this link.

Applying Decision Trees

[5.1] Applying Decision Trees on BOW, SET 1

In [31]:
# Please write all the code with proper documentation
import numpy as np
import pandas as pd
import math
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn import tree
bow_vect=CountVectorizer()
x=preprocessed_reviews
y=np.array(final['Score'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
x_train,x_cv,y_train,y_cv=train_test_split(x_train,y_train,test_size=0.3)

fbowx_tr=bow_vect.fit_transform(x_train)
fbowx_cv=bow_vect.transform(x_cv)
fbowx_te=bow_vect.transform(x_test)

std=StandardScaler(with_mean=False) #Standardizing Data
fbowx_tr=std.fit_transform(fbowx_tr)
fbowx_cv=std.transform(fbowx_cv)
fbowx_te=std.transform(fbowx_te)

dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
In [32]:
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(fbowx_tr,y_train)
        prob_c=dt.predict_proba(fbowx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)
    probcv=dt.predict_proba(fbowx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(fbowx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 50
Best split value for max auc = 500
In [36]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(fbowx_tr,y_train)
        prob_c=dt.predict_proba(fbowx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(fbowx_tr,y_train)
    probcv=dt.predict_proba(fbowx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(fbowx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [37]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(fbowx_tr,y_train)
pred_te=dt.predict_proba(fbowx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(fbowx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [52]:
def find_best_threshold(threshould, fpr, tpr):
    t = threshould[np.argmax(tpr*(1-fpr))]
    # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
    print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
    return t

def predict_with_best_t(proba, threshould):
    predictions = []
    for i in proba:
        if i>=threshould:
            predictions.append(1)
        else:
            predictions.append(0)
    return predictions
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [39]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.6753203879622486 for threshold 0.833
Train confusion matrix
Out[39]:
Text(33,0.5,'Actual Label')
In [40]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.6753203879622486 for threshold 0.833
Train confusion matrix
Out[40]:
Text(33,0.5,'Actual Label')

[5.1.1] Top 20 important features from SET 1

In [59]:
# Please write all the code with proper documentation
all_features = bow_vect.get_feature_names()
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)

features=np.argsort(dt.feature_importances_)[::-1]
for i in features[0:20]:
    print(all_features[i])
not
great
worst
disappointed
money
horrible
good
return
best
delicious
love
waste
awful
nice
perfect
product
threw
loves
terrible
disappointing

[5.1.2] Graphviz visualization of Decision Tree on BOW, SET 1

In [35]:
from sklearn import tree
from graphviz import Source
import graphviz
feat = bow_vect.get_feature_names() 
Source(tree.export_graphviz(dt, out_file = None, feature_names = feat,max_depth=3))
#https://stackoverflow.com/questions/27817994/visualizing-decision-tree-in-scikit-learn
Out[35]:
Tree 0 not <= 0.347 gini = 0.272 samples = 43008 value = [6979, 36029] 1 disappointed <= 2.988 gini = 0.16 samples = 19636 value = [1724, 17912] 0->1 True 1196 great <= 0.855 gini = 0.349 samples = 23372 value = [5255, 18117] 0->1196 False 2 worst <= 5.87 gini = 0.152 samples = 19439 value = [1611, 17828] 1->2 1195 gini = 0.489 samples = 197 value = [113, 84] 1->1195 3 awful <= 5.036 gini = 0.148 samples = 19374 value = [1554, 17820] 2->3 1194 gini = 0.216 samples = 65 value = [57, 8] 2->1194 4 (...) 3->4 1193 (...) 3->1193 1197 money <= 2.698 gini = 0.39 samples = 17936 value = [4764, 13172] 1196->1197 2302 unfortunately <= 4.253 gini = 0.164 samples = 5436 value = [491, 4945] 1196->2302 1198 horrible <= 4.957 gini = 0.378 samples = 17242 value = [4357, 12885] 1197->1198 2297 waste <= 4.393 gini = 0.485 samples = 694 value = [407, 287] 1197->2297 1199 (...) 1198->1199 2296 (...) 1198->2296 2298 (...) 2297->2298 2301 (...) 2297->2301 2303 disappointed <= 2.988 gini = 0.156 samples = 5343 value = [454, 4889] 2302->2303 2764 gini = 0.479 samples = 93 value = [37, 56] 2302->2764 2304 (...) 2303->2304 2763 (...) 2303->2763

[5.2] Applying Decision Trees on TFIDF, SET 2

In [39]:
# Please write all the code with proper documentation
tf_vect=TfidfVectorizer(ngram_range=(1,2),min_df=10)
#tf_vect.fit(preprocessed_reviews)

ftfx_tr=tf_vect.fit_transform(x_train)
ftfx_cv=tf_vect.transform(x_cv)
ftfx_te=tf_vect.transform(x_test)

std = StandardScaler(with_mean=False)
ftfx_tr=std.fit_transform(ftfx_tr)#Standardizing Data
ftfx_cv=std.transform(ftfx_cv)
ftfx_te=std.transform(ftfx_te)

dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
In [41]:
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(ftfx_tr,y_train)
        prob_c=dt.predict_proba(ftfx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)
    probcv=dt.predict_proba(ftfx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(ftfx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 50
Best split value for max auc = 500
In [59]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(ftfx_tr,y_train)
        prob_c=dt.predict_proba(ftfx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(ftfx_tr,y_train)
    probcv=dt.predict_proba(ftfx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(ftfx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [44]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [45]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7123776078379379 for threshold 0.886
Train confusion matrix
Out[45]:
Text(33,0.5,'Actual Label')
In [46]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7123776078379379 for threshold 0.886
Train confusion matrix
Out[46]:
Text(33,0.5,'Actual Label')

[5.2.1] Top 20 important features from SET 2

In [68]:
# Please write all the code with proper documentation
all_features = tf_vect.get_feature_names()
dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)

features=np.argsort(dt.feature_importances_)[::-1]
for i in features[0:20]:
    print(all_features[i])
not
great
not buy
disappointed
worst
money
horrible
good
love
best
not disappointed
return
awful
product
terrible
not recommend
delicious
threw
bad
unfortunately

[5.2.2] Graphviz visualization of Decision Tree on TFIDF, SET 2

In [42]:
# Please write all the code with proper documentation
from sklearn import tree
from graphviz import Source
import graphviz
feat = tf_vect.get_feature_names() 
Source(tree.export_graphviz(dt, out_file = None, feature_names = feat,max_depth=3))
Out[42]:
Tree 0 not <= 1.856 gini = 0.272 samples = 43008 value = [6979, 36029] 1 worst <= 4.989 gini = 0.228 samples = 36107 value = [4746, 31361] 0->1 True 1988 great <= 0.357 gini = 0.438 samples = 6901 value = [2233, 4668] 0->1988 False 2 disappointed <= 3.491 gini = 0.222 samples = 35910 value = [4578, 31332] 1->2 1987 gini = 0.251 samples = 197 value = [168, 29] 1->1987 3 great <= 0.56 gini = 0.212 samples = 35272 value = [4245, 31027] 2->3 1982 not disappointed <= 6.783 gini = 0.499 samples = 638 value = [333, 305] 2->1982 4 (...) 3->4 1677 (...) 3->1677 1983 (...) 1982->1983 1986 (...) 1982->1986 1989 not buy <= 3.667 gini = 0.467 samples = 5500 value = [2039, 3461] 1988->1989 2192 not great <= 8.887 gini = 0.239 samples = 1401 value = [194, 1207] 1988->2192 1990 not worth <= 5.314 gini = 0.457 samples = 5249 value = [1852, 3397] 1989->1990 2191 gini = 0.38 samples = 251 value = [187, 64] 1989->2191 1991 (...) 1990->1991 2190 (...) 1990->2190 2193 waste money <= 9.302 gini = 0.213 samples = 1326 value = [161, 1165] 2192->2193 2304 gini = 0.493 samples = 75 value = [33, 42] 2192->2304 2194 (...) 2193->2194 2303 (...) 2193->2303

[5.3] Applying Decision Trees on AVG W2V, SET 3

In [47]:
# Please write all the code with proper documentation
#Avg word2vec for train data
sent_train_list=[]
for sentence in x_train:
    sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)

sent_train_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_train_vectors.append(sent_vec)
print(len(sent_train_vectors))
print(len(sent_train_vectors[0]))

#Avg word2vec for cv data
sent_cv_list=[]
for sentence in x_cv:
    sent_cv_list.append(sentence.split())

sent_cv_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_cv_vectors.append(sent_vec)
print(len(sent_cv_vectors))
print(len(sent_cv_vectors[0]))


#Avg word2vec for test data
sent_test_list=[]
for sentence in x_test:
    sent_test_list.append(sentence.split())

sent_test_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_test_vectors.append(sent_vec)
print(len(sent_test_vectors))
print(len(sent_test_vectors[0]))


#This code is copied and modified from :https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW#scrollTo=3-XGItt4PSx0
43008
50
18433
50
26332
50
In [48]:
aw2vx_tr=sent_train_vectors
aw2vx_cv=sent_cv_vectors
aw2vx_te=sent_test_vectors

depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(aw2vx_tr,y_train)
        prob_c=dt.predict_proba(aw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(aw2vx_tr,y_train)
    probcv=dt.predict_proba(aw2vx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(aw2vx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 500
Best split value for max auc = 500
In [51]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(aw2vx_tr,y_train)
        prob_c=dt.predict_proba(aw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(aw2vx_tr,y_train)
    probcv=dt.predict_proba(aw2vx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(aw2vx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [52]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [53]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.8546495284346936 for threshold 0.886
Train confusion matrix
Out[53]:
Text(33,0.5,'Actual Label')
In [54]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.8546495284346936 for threshold 0.886
Train confusion matrix
Out[54]:
Text(33,0.5,'Actual Label')

[5.4] Applying Decision Trees on TFIDF W2V, SET 4

In [46]:
# Please write all the code with proper documentation
sent_train_list=[]
for sentence in x_train:
    sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),min_df=10, max_features=500)
tf_idf_matrix=tf_idf_vect.fit_transform(x_train)
tfidf_feat = tf_idf_vect.get_feature_names() 
dictionary = dict(zip(tf_idf_vect.get_feature_names(), list(tf_idf_vect.idf_)))

#Train data
tfidf_sent_train_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_train_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_train_vectors.append(sent_vec)
    row += 1
    
#for cv
sent_cv_list=[]
for sentence in x_cv:
    sent_cv_list.append(sentence.split())
tfidf_sent_cv_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
            # tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_cv_vectors.append(sent_vec)
    row += 1
    
#Test data
sent_test_list=[]
for sentence in x_test:
    sent_test_list.append(sentence.split())
tfidf_sent_test_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_test_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
            # tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_test_vectors.append(sent_vec)
    row += 1



In [47]:
tfw2vx_tr=tfidf_sent_train_vectors
tfw2vx_cv=tfidf_sent_cv_vectors
tfw2vx_te=tfidf_sent_test_vectors

depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(tfw2vx_tr,y_train)
        prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(tfw2vx_tr,y_train)
    probcv=dt.predict_proba(tfw2vx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(tfw2vx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 10
Best split value for max auc = 500
In [55]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(tfw2vx_tr,y_train)
        prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(tfw2vx_tr,y_train)
    probcv=dt.predict_proba(tfw2vx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(tfw2vx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [50]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(tfw2vx_tr,y_train)
pred_te=dt.predict_proba(tfw2vx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(tfw2vx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [53]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5385126985638196 for threshold 0.835
Train confusion matrix
Out[53]:
Text(33,0.5,'Actual Label')
In [54]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5385126985638196 for threshold 0.835
Train confusion matrix
Out[54]:
Text(33,0.5,'Actual Label')

Feature Engineering

In [90]:
for i in range(len(preprocessed_reviews)):  #considering  some features from reviw summary and length of review text
    preprocessed_reviews[i]=preprocessed_reviews[i]+ ' '+preprocessed_summary[i]+' '+str(len(final.Text.iloc[i]))
preprocessed_fe_reviews=preprocessed_reviews   
In [91]:
preprocessed_fe_reviews[1500]
Out[91]:
'way hot blood took bite jig lol hot stuff 59 hot stuff 59'

[5.1] Applying Decision Trees on BOW, SET 1

In [92]:
bow_vect=CountVectorizer()
x=preprocessed_fe_reviews
y=np.array(final['Score'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
x_train,x_cv,y_train,y_cv=train_test_split(x_train,y_train,test_size=0.3)

fbowx_tr=bow_vect.fit_transform(x_train)
fbowx_cv=bow_vect.transform(x_cv)
fbowx_te=bow_vect.transform(x_test)

std=StandardScaler(with_mean=False) #Standardizing Data
fbowx_tr=std.fit_transform(fbowx_tr)
fbowx_cv=std.transform(fbowx_cv)
fbowx_te=std.transform(fbowx_te)

dt=tree.DecisionTreeClassifier().fit(fbowx_tr,y_train)
In [93]:
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(fbowx_tr,y_train)
        prob_c=dt.predict_proba(fbowx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(fbowx_tr,y_train)
    probcv=dt.predict_proba(fbowx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(fbowx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 50
Best split value for max auc = 500
In [61]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(fbowx_tr,y_train)
        prob_c=dt.predict_proba(fbowx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(fbowx_tr,y_train)
    probcv=dt.predict_proba(fbowx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(fbowx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [62]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(fbowx_tr,y_train)
pred_te=dt.predict_proba(fbowx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(fbowx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [63]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7348264235302939 for threshold 0.841
Train confusion matrix
Out[63]:
Text(33,0.5,'Actual Label')
In [64]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7348264235302939 for threshold 0.841
Train confusion matrix
Out[64]:
Text(33,0.5,'Actual Label')
In [95]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='max_depth'),
        yaxis = dict(title='max_split'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')

[5.2] Applying Decision Trees on TFIDF, SET 2

In [112]:
# Please write all the code with proper documentation
tf_vect=TfidfVectorizer(ngram_range=(1,2),min_df=10)
#tf_vect.fit(preprocessed_reviews)

ftfx_tr=tf_vect.fit_transform(x_train)
ftfx_cv=tf_vect.transform(x_cv)
ftfx_te=tf_vect.transform(x_test)

std = StandardScaler(with_mean=False)
ftfx_tr=std.fit_transform(ftfx_tr)#Standardizing Data
ftfx_cv=std.transform(ftfx_cv)
ftfx_te=std.transform(ftfx_te)
In [113]:
depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(ftfx_tr,y_train)
        prob_c=dt.predict_proba(ftfx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(ftfx_tr,y_train)
    probcv=dt.predict_proba(ftfx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(ftfx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 50
Best split value for max auc = 500
In [67]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(ftfx_tr,y_train)
        prob_c=dt.predict_proba(ftfx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(ftfx_tr,y_train)
    probcv=dt.predict_proba(ftfx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(ftfx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [68]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [69]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7628547346401273 for threshold 0.845
Train confusion matrix
Out[69]:
Text(33,0.5,'Actual Label')
In [70]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.7628547346401273 for threshold 0.845
Train confusion matrix
Out[70]:
Text(33,0.5,'Actual Label')
In [114]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='max_depth'),
        yaxis = dict(title='max_split'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')

[5.3] Applying Decision Trees on AVG W2V, SET 3

In [115]:
# Please write all the code with proper documentation
#Avg word2vec for train data
sent_train_list=[]
for sentence in x_train:
    sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)

sent_train_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_train_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_train_vectors.append(sent_vec)
print(len(sent_train_vectors))
print(len(sent_train_vectors[0]))

#Avg word2vec for cv data
sent_cv_list=[]
for sentence in x_cv:
    sent_cv_list.append(sentence.split())

sent_cv_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_cv_vectors.append(sent_vec)
print(len(sent_cv_vectors))
print(len(sent_cv_vectors[0]))


#Avg word2vec for test data
sent_test_list=[]
for sentence in x_test:
    sent_test_list.append(sentence.split())

sent_test_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm_notebook(sent_test_list): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_test_vectors.append(sent_vec)
print(len(sent_test_vectors))
print(len(sent_test_vectors[0]))


#This code is copied and modified from :https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW#scrollTo=3-XGItt4PSx0
43008
50
18433
50
26332
50
In [116]:
aw2vx_tr=sent_train_vectors
aw2vx_cv=sent_cv_vectors
aw2vx_te=sent_test_vectors

depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(aw2vx_tr,y_train)
        prob_c=dt.predict_proba(aw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(aw2vx_tr,y_train)
    probcv=dt.predict_proba(aw2vx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(aw2vx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 10
Best split value for max auc = 500
In [117]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(aw2vx_tr,y_train)
        prob_c=dt.predict_proba(aw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(aw2vx_tr,y_train)
    probcv=dt.predict_proba(aw2vx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(aw2vx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("min_split: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [74]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(ftfx_tr,y_train)
pred_te=dt.predict_proba(ftfx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(ftfx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [75]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5464143712685272 for threshold 0.845
Train confusion matrix
Out[75]:
Text(33,0.5,'Actual Label')
In [76]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5464143712685272 for threshold 0.845
Train confusion matrix
Out[76]:
Text(33,0.5,'Actual Label')
In [118]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='max_depth'),
        yaxis = dict(title='max_split'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')

[5.4] Applying Decision Trees on TFIDF W2V, SET 4

In [119]:
# Please write all the code with proper documentation
sent_train_list=[]
for sentence in x_train:
    sent_train_list.append(sentence.split())
w2v_model=Word2Vec(sent_train_list,min_count=5,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),min_df=10, max_features=500)
tf_idf_matrix=tf_idf_vect.fit_transform(x_train)
tfidf_feat = tf_idf_vect.get_feature_names() 
dictionary = dict(zip(tf_idf_vect.get_feature_names(), list(tf_idf_vect.idf_)))

#Train data
tfidf_sent_train_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_train_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_train_vectors.append(sent_vec)
    row += 1
    
#for cv
sent_cv_list=[]
for sentence in x_cv:
    sent_cv_list.append(sentence.split())
tfidf_sent_cv_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_cv_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
            # tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_cv_vectors.append(sent_vec)
    row += 1
    
#Test data
sent_test_list=[]
for sentence in x_test:
    sent_test_list.append(sentence.split())
tfidf_sent_test_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm_notebook(sent_test_list): # for each review/sentence 
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
            # tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_test_vectors.append(sent_vec)
    row += 1



In [120]:
tfw2vx_tr=tfidf_sent_train_vectors
tfw2vx_cv=tfidf_sent_cv_vectors
tfw2vx_te=tfidf_sent_test_vectors

depths=[1,5,10,50,100,500,1000]
best_m=[]
min_splits=[2,5,10,15,100,500]
auc_train=[]
auc_cv=[]
for d in tqdm_notebook(depths):
    ms,rc=0,0
    #print(d)
    for s in min_splits:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(tfw2vx_tr,y_train)
        prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            ms=s
    dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=ms).fit(tfw2vx_tr,y_train)
    probcv=dt.predict_proba(tfw2vx_cv)[:,1]
    auc_cv.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(tfw2vx_tr)[:,1]
    auc_train.append(roc_auc_score(y_train,probtr))
    
best_depth= depths[auc_cv.index(max(auc_cv))]
best_min_split=best_m[auc_cv.index(max(auc_cv))]

plt.plot(depths, auc_train, label='Train AUC')
plt.plot(depths, auc_cv, label='CV AUC')

plt.scatter(depths, auc_train)
plt.scatter(depths, auc_cv)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs depth")
plt.grid()
plt.show()
print("Best Depth value for max auc =",best_depth)
print("Best split value for max auc =",best_min_split)

Best Depth value for max auc = 10
Best split value for max auc = 500
In [121]:
auc_train_s=[]
auc_cv_s=[]
for s in tqdm_notebook(min_splits):
    dep,rc=0,0
    #print(d)
    for d in depths:
        #print(m)
        dt=tree.DecisionTreeClassifier(max_depth=d,min_samples_split=s)
        dt.fit(tfw2vx_tr,y_train)
        prob_c=dt.predict_proba(tfw2vx_cv)[:,1]
        val=roc_auc_score(y_cv,prob_c)
        if val>rc:
            rc=val
            dep=d
    dt=tree.DecisionTreeClassifier(max_depth=dep,min_samples_split=s).fit(tfw2vx_tr,y_train)
    probcv=dt.predict_proba(tfw2vx_cv)[:,1]
    auc_cv_s.append(roc_auc_score(y_cv,probcv))
    best_m.append(ms)
    probtr=dt.predict_proba(tfw2vx_tr)[:,1]
    auc_train_s.append(roc_auc_score(y_train,probtr))
    
plt.plot(min_splits, auc_train_s, label='Train AUC')
plt.plot(min_splits, auc_cv_s, label='CV AUC')

plt.scatter(min_splits, auc_train_s)
plt.scatter(min_splits, auc_cv_s)
plt.legend()
plt.xlabel("depth: hyperparameter")
plt.ylabel("AUC")
plt.title("AUC vs min_split")
plt.grid()
plt.show()

In [80]:
#Plotting ROC_AUC curve
dt=tree.DecisionTreeClassifier(max_depth=best_depth,min_samples_split=best_min_split).fit(tfw2vx_tr,y_train)
pred_te=dt.predict_proba(tfw2vx_te)[:,1]
fpr_te, trp_te, thresholds_te = metrics.roc_curve(y_test, pred_te)
pred_tr=dt.predict_proba(tfw2vx_tr)[:,1]
fpr_tr,tpr_tr,thresholds_tr=metrics.roc_curve(y_train,pred_tr)

plt.plot(fpr_te, trp_te, label='Test ROC ,auc='+str(roc_auc_score(y_test,pred_te)))
plt.plot(fpr_tr, tpr_tr, label='Train ROC ,auc='+str(roc_auc_score(y_train,pred_tr)))
plt.title('ROC_AUC')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
plt.show()
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
In [81]:
#Comfuion matrix for Train data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_train, predict_with_best_t(pred_tr, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5737990547715568 for threshold 0.834
Train confusion matrix
Out[81]:
Text(33,0.5,'Actual Label')
In [82]:
#Comfuion matrix for Test data
from sklearn.metrics import confusion_matrix
best_t = find_best_threshold(thresholds_tr, fpr_tr, tpr_tr)
print("Train confusion matrix")
df=pd.DataFrame(confusion_matrix(y_test, predict_with_best_t(pred_te, best_t)),index=['Negative','Positive'],columns=['Negative','Positive'])
sns.heatmap(df,annot = True,fmt='d',cmap="Blues")
plt.title('Confusion metrix')
plt.xlabel("Predicted Label")
plt.ylabel("Actual Label")
#This code is copied and modified from: https://colab.research.google.com/drive/1EkYHI-vGKnURqLL_u5LEf3yb0YJBVbZW
the maximum value of tpr*(1-fpr) 0.5737990547715568 for threshold 0.834
Train confusion matrix
Out[82]:
Text(33,0.5,'Actual Label')
In [122]:
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
trace1 = go.Scatter3d(x=depths,y=min_splits,z=auc_train, name = 'train')
trace2 = go.Scatter3d(x=depths,y=min_splits,z=auc_cv, name = 'Cross validation')
data = [trace1, trace2]

layout = go.Layout(scene = dict(
        xaxis = dict(title='max_depth'),
        yaxis = dict(title='max_split'),
        zaxis = dict(title='AUC'),))

fig = go.Figure(data=data, layout=layout)
offline.iplot(fig, filename='3d-scatter-colorscale')

[6] Conclusions

In [47]:
# Please compare all your models using Prettytable library
x=PrettyTable()
x.field_names=(['Vectorizer','Best_Depth','Best_Split','AUC','Feature Engineering'])
x.add_row(['BOW',50,500,0.822,'NO'])
x.add_row(['TF-IDF',50,500,0.812,'NO'])
x.add_row(['AW2V',500,500,0.778,'NO'])
x.add_row(['TF-IDF_w2v ',10,500,0.749,'NO'])
x.add_row(['BOW',50,500,0.857,'Yes'])
x.add_row(['TF-IDF',50,500,0.856,'Yes'])
x.add_row(['AW2V',10,500,0.807,'Yes'])
x.add_row(['TF-IDF_w2v',10,500,0.783,'Yes'])
print(x)
+-------------+------------+------------+-------+---------------------+
|  Vectorizer | Best_Depth | Best_Split |  AUC  | Feature Engineering |
+-------------+------------+------------+-------+---------------------+
|     BOW     |     50     |    500     | 0.822 |          NO         |
|    TF-IDF   |     50     |    500     | 0.812 |          NO         |
|     AW2V    |    500     |    500     | 0.778 |          NO         |
| TF-IDF_w2v  |     10     |    500     | 0.749 |          NO         |
|     BOW     |     50     |    500     | 0.857 |         Yes         |
|    TF-IDF   |     50     |    500     | 0.856 |         Yes         |
|     AW2V    |     10     |    500     | 0.807 |         Yes         |
|  TF-IDF_w2v |     10     |    500     | 0.783 |         Yes         |
+-------------+------------+------------+-------+---------------------+

After feature engineerinng there is a slight increment in the accuracy score